Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 36
Filter
1.
Nucleic Acids Res ; 50(D1): D988-D995, 2022 01 07.
Article in English | MEDLINE | ID: mdl-34791404

ABSTRACT

Ensembl (https://www.ensembl.org) is unique in its flexible infrastructure for access to genomic data and annotation. It has been designed to efficiently deliver annotation at scale for all eukaryotic life, and it also provides deep comprehensive annotation for key species. Genomes representing a greater diversity of species are increasingly being sequenced. In response, we have focussed our recent efforts on expediting the annotation of new assemblies. Here, we report the release of the greatest annual number of newly annotated genomes in the history of Ensembl via our dedicated Ensembl Rapid Release platform (http://rapid.ensembl.org). We have also developed a new method to generate comparative analyses at scale for these assemblies and, for the first time, we have annotated non-vertebrate eukaryotes. Meanwhile, we continually improve, extend and update the annotation for our high-value reference vertebrate genomes and report the details here. We have a range of specific software tools for specific tasks, such as the Ensembl Variant Effect Predictor (VEP) and the newly developed interface for the Variant Recoder. All Ensembl data, software and tools are freely available for download and are accessible programmatically.


Subject(s)
Databases, Genetic , Genome/genetics , Molecular Sequence Annotation , Software , Animals , Computational Biology/classification , Humans
2.
São Paulo; s.n; s.n; 2022. 186 p. tab, graf, ilus.
Thesis in Portuguese | LILACS | ID: biblio-1397348

ABSTRACT

Os avanços metodológicos e instrumentais decorrentes do Projeto Genoma Humano formaram o arcabouço necessário para o surgimento das tecnologias de sequenciamento de DNA de Nova Geração, as quais se caracterizam por um custo reduzido, uma baixa demanda operacional e a produção de um grande volume de dados por experimento. Concomitantemente a isso, o aumento no poder de processamento computacional permitiu o desenvolvimento de análises genéticas em larga escala, de modo que, atualmente, é possível estudar características genômicas individualizadas e, até então, pouco ou nunca exploradas. Dentre essas características, aquelas relacionadas às variações estruturais em genomas têm recebido bastante atenção. Os pseudogenes processados, ou retrocópias, são variações estruturais causadas pela duplicação de genes codificadores mediante à transposição de seu RNA mensageiro maduro pela maquinaria enzimática de LINE- 1. As retrocópias podem estar fixadas, ou seja, presentes em todos os genomas de uma dada espécie, os quais são representados pela montagem modelo do genoma de referência, ou podem não estar fixadas, sendo polimórficas, germinativas ou somáticas. No entanto, o conhecimento acerca das retrocópias não fixadas ainda é limitado devido à falta de ferramentas de bioinformática dedicadas a sua identificação e anotação em dados de sequenciamento de DNA. Posto isso, este trabalho apresenta o sideRETRO um programa computacional especializado na detecção de pseudogenes processados ausentes do genoma de referência, mas presentes em dados de sequenciamento de genoma completo e exoma de outros indivíduos. Além de apontar para a presença de retrocópias não fixadas, o sideRETRO é capaz de anotar várias outras características relacionadas a esses evento, tais como: a coordenada genômica de inserção do pseudogene processado, a qual constitui o cromossomo, o ponto de inserção e a fita de DNA (líder or retardada); o contexto genômico do evento (exônico, intrônico ou intergênico); a genotipagem (presente ou ausente) e a haplotipagem (em homozigose ou heterozigose). Para atestar a eficiência da ferramenta, o sideRETRO foi executado para dados simulados e para dados reais validados experimentalmente por um grupo independente. Portanto, em resumo, nesta tese são descritos o desenvolvimento e o uso do sideRETRO uma ferramenta computacional robusta e eficiente, designada para identificar e anotar pseudogenes processados não fixados. Por fim, vale destacar que o sideRETRO preenche uma lacuna metodológica e possibilita novas hipóteses e investigações sistemáticas no campo de chamada de variantes estruturais


The methodological and instrumental advances resulting from the Human Genome Project have created the necessary framework to the emergence of Next Generation DNA sequencing technologies, which are characterized by a reduced cost, low operational demand and the generation of a large volume of data per experiment. Concomitantly with this, the increase in computational processing power has driven the development of large-scale genetic analyses, which allowed us to study individualized genomic traits little or never explored before. Among these characteristics, those related to structural variations in genomes have received much attention. Processed pseudogenes, or retrocopies, are structural variations caused by the duplication of coding genes through the transposition of their mature messenger RNA by the LINE-1 enzymatic machinery. Retrocopies can be fixed (i.e., present in all genomes of a given species and included into the assembly of the reference genome) or unfixed, being polymorphic, germinal or somatic. However, knowledge about unfixed retrocopies is still limited due to the lack of bioinformatics tools dedicated to their identification and annotation in DNA sequencing data. Therefore, this work presents sideRETRO a computer program specialized in the detection of processed pseudogenes absent from the reference genome, but present in whole genome and exome sequencing data from other individuals. In addition to pointing out the presence of unfixed retrocopies, sideRETRO is able to annotate several other characteristics related to these events, such as: the genomic coordinate of the processed pseudogene insetion, which constitutes the chromosome, the insertion point and the DNA strand (leader or retard); the genomic context of the event (exonic, intronic or intergenic); genotyping (present or absent) and haplotyping (homozygous or heterozygous). To certify the sideRETRO efficiency, it was run on simulated data and on real data experimentally validated by an independent group. Therefore, in summary, this thesis describes the development and use of sideRETRO a robust and efficient computational tool, designed to identify and annotate unfixed processed pseudogenes. Finally, it is worth noting that sideRETRO fills a methodological gap and allows new hypotheses and systematic investigations in the field of structural variant calling


Subject(s)
Polymorphism, Genetic/genetics , Computational Biology/classification , Computational Biology/instrumentation , Costs and Cost Analysis , Genomics/instrumentation , Sequence Analysis, DNA/instrumentation , Clinical Coding
3.
Database (Oxford) ; 20202020 01 01.
Article in English | MEDLINE | ID: mdl-32294192

ABSTRACT

Gathering information from the scientific literature is essential for biomedical research, as much knowledge is conveyed through publications. However, the large and rapidly increasing publication rate makes it impractical for researchers to quickly identify all and only those documents related to their interest. As such, automated biomedical document classification attracts much interest. Such classification is critical in the curation of biological databases, because biocurators must scan through a vast number of articles to identify pertinent information within documents most relevant to the database. This is a slow, labor-intensive process that can benefit from effective automation.We present a document classification scheme aiming to identify papers containing information relevant to a specific topic, among a large collection of articles, for supporting the biocuration classification task. Our framework is based on a meta-classification scheme we have introduced before; here we incorporate into it features gathered from figure captions, in addition to those obtained from titles and abstracts. We trained and tested our classifier over a large imbalanced dataset, originally curated by the Gene Expression Database (GXD). GXD collects all the gene expression information in the Mouse Genome Informatics (MGI) resource. As part of the MGI literature classification pipeline, GXD curators identify MGI-selected papers that are relevant for GXD. The dataset consists of ~60 000 documents (5469 labeled as relevant; 52 866 as irrelevant), gathered throughout 2012-2016, in which each document is represented by the text of its title, abstract and figure captions. Our classifier attains precision 0.698, recall 0.784, f-measure 0.738 and Matthews correlation coefficient 0.711, demonstrating that the proposed framework effectively addresses the high imbalance in the GXD classification task. Moreover, our classifier's performance is significantly improved by utilizing information from image captions compared to using titles and abstracts alone; this observation clearly demonstrates that image captions provide substantial information for supporting biomedical document classification and curation.Database URL.


Subject(s)
Biomedical Research/statistics & numerical data , Computational Biology/methods , Data Curation/methods , Databases, Factual , Animals , Biomedical Research/classification , Biomedical Research/methods , Computational Biology/classification , Data Mining/methods , Humans , Internet
4.
IEEE Trans Neural Netw Learn Syst ; 31(8): 2857-2867, 2020 08.
Article in English | MEDLINE | ID: mdl-31170082

ABSTRACT

In the postgenome era, many problems in bioinformatics have arisen due to the generation of large amounts of imbalanced data. In particular, the computational classification of precursor microRNA (pre-miRNA) involves a high imbalance in the classes. For this task, a classifier is trained to identify RNA sequences having the highest chance of being miRNA precursors. The big issue is that well-known pre-miRNAs are usually just a few in comparison to the hundreds of thousands of candidate sequences in a genome, which results in highly imbalanced data. This imbalance has a strong influence on most standard classifiers and, if not properly addressed, the classifier is not able to work properly in a real-life scenario. This work provides a comparative assessment of recent deep neural architectures for dealing with the large imbalanced data issue in the classification of pre-miRNAs. We present and analyze recent architectures in a benchmark framework with genomes of animals and plants, with increasing imbalance ratios up to 1:2000. We also propose a new graphical way for comparing classifiers performance in the context of high-class imbalance. The comparative results obtained show that, at a very high imbalance, deep belief neural networks can provide the best performance.


Subject(s)
Computational Biology/classification , Computational Biology/methods , Databases, Factual/classification , Deep Learning/classification , Neural Networks, Computer , Plants/classification , Animals , Elasticity , Humans
6.
ScientificWorldJournal ; 2014: 179105, 2014.
Article in English | MEDLINE | ID: mdl-25276846

ABSTRACT

This paper analyses the effect of the effort distribution along the software development lifecycle on the prevalence of software defects. This analysis is based on data that was collected by the International Software Benchmarking Standards Group (ISBSG) on the development of 4,106 software projects. Data mining techniques have been applied to gain a better understanding of the behaviour of the project activities and to identify a link between the effort distribution and the prevalence of software defects. This analysis has been complemented with the use of a hierarchical clustering algorithm with a dissimilarity based on the likelihood ratio statistic, for exploratory purposes. As a result, different behaviours have been identified for this collection of software development projects, allowing for the definition of risk control strategies to diminish the number and impact of the software defects. It is expected that the use of similar estimations might greatly improve the awareness of project managers on the risks at hand.


Subject(s)
Algorithms , Software , Cluster Analysis , Computational Biology/classification , Computational Biology/methods , Data Mining/classification , Data Mining/methods , Discriminant Analysis , Reproducibility of Results , Software Design , Software Validation
7.
9.
BMC Bioinformatics ; 14: 350, 2013 Dec 03.
Article in English | MEDLINE | ID: mdl-24299119

ABSTRACT

BACKGROUND: Drosophila melanogaster has been established as a model organism for investigating the developmental gene interactions. The spatio-temporal gene expression patterns of Drosophila melanogaster can be visualized by in situ hybridization and documented as digital images. Automated and efficient tools for analyzing these expression images will provide biological insights into the gene functions, interactions, and networks. To facilitate pattern recognition and comparison, many web-based resources have been created to conduct comparative analysis based on the body part keywords and the associated images. With the fast accumulation of images from high-throughput techniques, manual inspection of images will impose a serious impediment on the pace of biological discovery. It is thus imperative to design an automated system for efficient image annotation and comparison. RESULTS: We present a computational framework to perform anatomical keywords annotation for Drosophila gene expression images. The spatial sparse coding approach is used to represent local patches of images in comparison with the well-known bag-of-words (BoW) method. Three pooling functions including max pooling, average pooling and Sqrt (square root of mean squared statistics) pooling are employed to transform the sparse codes to image features. Based on the constructed features, we develop both an image-level scheme and a group-level scheme to tackle the key challenges in annotating Drosophila gene expression pattern images automatically. To deal with the imbalanced data distribution inherent in image annotation tasks, the undersampling method is applied together with majority vote. Results on Drosophila embryonic expression pattern images verify the efficacy of our approach. CONCLUSION: In our experiment, the three pooling functions perform comparably well in feature dimension reduction. The undersampling with majority vote is shown to be effective in tackling the problem of imbalanced data. Moreover, combining sparse coding and image-level scheme leads to consistent performance improvement in keywords annotation.


Subject(s)
Drosophila melanogaster/cytology , Drosophila melanogaster/genetics , Gene Expression Regulation, Developmental , Genome, Insect/genetics , Models, Genetic , Molecular Sequence Annotation/methods , Animals , Cell Differentiation/genetics , Cell Division/genetics , Computational Biology/classification , Computational Biology/methods , Drosophila melanogaster/embryology , Gene Expression Profiling/classification , Gene Expression Profiling/methods , High-Throughput Screening Assays , Molecular Sequence Annotation/classification , Predictive Value of Tests , Support Vector Machine
10.
Hum Mutat ; 34(1): 200-9, 2013 Jan.
Article in English | MEDLINE | ID: mdl-22949379

ABSTRACT

Mismatch repair (MMR) gene sequence variants of uncertain clinical significance are often identified in suspected Lynch syndrome families, and this constitutes a challenge for both researchers and clinicians. Multifactorial likelihood model approaches provide a quantitative measure of MMR variant pathogenicity, but first require input of likelihood ratios (LRs) for different MMR variation-associated characteristics from appropriate, well-characterized reference datasets. Microsatellite instability (MSI) and somatic BRAF tumor data for unselected colorectal cancer probands of known pathogenic variant status were used to derive LRs for tumor characteristics using the Colon Cancer Family Registry (CFR) resource. These tumor LRs were combined with variant segregation within families, and estimates of prior probability of pathogenicity based on sequence conservation and position, to analyze 44 unclassified variants identified initially in Australasian Colon CFR families. In addition, in vitro splicing analyses were conducted on the subset of variants based on bioinformatic splicing predictions. The LR in favor of pathogenicity was estimated to be ~12-fold for a colorectal tumor with a BRAF mutation-negative MSI-H phenotype. For 31 of the 44 variants, the posterior probabilities of pathogenicity were such that altered clinical management would be indicated. Our findings provide a working multifactorial likelihood model for classification that carefully considers mode of ascertainment for gene testing.


Subject(s)
Colonic Neoplasms/genetics , Computational Biology/methods , DNA Mismatch Repair/genetics , Mutation , Adaptor Proteins, Signal Transducing/genetics , Alternative Splicing/genetics , Computational Biology/classification , Computational Biology/statistics & numerical data , DNA Mutational Analysis/methods , DNA Mutational Analysis/statistics & numerical data , DNA-Binding Proteins/genetics , Family Health , Humans , Likelihood Functions , Microsatellite Instability , Microsatellite Repeats/genetics , MutL Protein Homolog 1 , MutS Homolog 2 Protein/genetics , Nuclear Proteins/genetics , Proto-Oncogene Proteins B-raf/genetics , Registries/classification , Registries/statistics & numerical data
11.
Hum Mutat ; 34(1): 255-65, 2013 Jan.
Article in English | MEDLINE | ID: mdl-22949387

ABSTRACT

Classification of rare missense substitutions observed during genetic testing for patient management is a considerable problem in clinical genetics. The Bayesian integrated evaluation of unclassified variants is a solution originally developed for BRCA1/2. Here, we take a step toward an analogous system for the mismatch repair (MMR) genes (MLH1, MSH2, MSH6, and PMS2) that confer colon cancer susceptibility in Lynch syndrome by calibrating in silico tools to estimate prior probabilities of pathogenicity for MMR gene missense substitutions. A qualitative five-class classification system was developed and applied to 143 MMR missense variants. This identified 74 missense substitutions suitable for calibration. These substitutions were scored using six different in silico tools (Align-Grantham Variation Grantham Deviation, multivariate analysis of protein polymorphisms [MAPP], MutPred, PolyPhen-2.1, Sorting Intolerant From Tolerant, and Xvar), using curated MMR multiple sequence alignments where possible. The output from each tool was calibrated by regression against the classifications of the 74 missense substitutions; these calibrated outputs are interpretable as prior probabilities of pathogenicity. MAPP was the most accurate tool and MAPP + PolyPhen-2.1 provided the best-combined model (R(2)  = 0.62 and area under receiver operating characteristic = 0.93). The MAPP + PolyPhen-2.1 output is sufficiently predictive to feed as a continuous variable into the quantitative Bayesian integrated evaluation for clinical classification of MMR gene missense substitutions.


Subject(s)
Computational Biology/methods , DNA Mismatch Repair/genetics , Genetic Predisposition to Disease/genetics , Mutation, Missense , Adaptor Proteins, Signal Transducing/genetics , Adenosine Triphosphatases/genetics , Bayes Theorem , Calibration , Colorectal Neoplasms, Hereditary Nonpolyposis/genetics , Computational Biology/classification , Computational Biology/standards , DNA Repair Enzymes/genetics , DNA-Binding Proteins/genetics , Humans , Mismatch Repair Endonuclease PMS2 , MutL Protein Homolog 1 , MutS Homolog 2 Protein/genetics , Nuclear Proteins/genetics , Regression Analysis , Reproducibility of Results
13.
Adv Exp Med Biol ; 736: 617-43, 2012.
Article in English | MEDLINE | ID: mdl-22161356

ABSTRACT

One of the central problems of cancer systems biology is to understand the complex molecular changes of cancerous cells and tissues, and use this understanding to support the development of new targeted therapies. EPoC (Endogenous Perturbation analysis of Cancer) is a network modeling technique for tumor molecular profiles. EPoC models are constructed from combined copy number aberration (CNA) and mRNA data and aim to (1) identify genes whose copy number aberrations significantly affect target mRNA expression and (2) generate markers for long- and short-term survival of cancer patients. Models are constructed by a combination of regression and bootstrapping methods. Prognostic scores are obtained from a singular value decomposition of the networks. We have previously analyzed the performance of EPoC using glioblastoma data from The Cancer Genome Atlas (TCGA) consortium, and have shown that resulting network models contain both known and candidate disease-relevant genes as network hubs, as well as uncover predictors of patient survival. Here, we give a practical guide how to perform EPoC modeling in practice using R, and present a set of alternative modeling frameworks.


Subject(s)
Computational Biology/methods , Gene Regulatory Networks/genetics , Models, Genetic , Neoplasms/genetics , Systems Biology/methods , Algorithms , Computational Biology/classification , Gene Dosage , Gene Expression Regulation, Neoplastic , Gene Regulatory Networks/drug effects , Genetic Predisposition to Disease/genetics , Glioblastoma/drug therapy , Glioblastoma/genetics , Humans , Neoplasms/drug therapy , Prognosis , Reproducibility of Results , Survival Analysis
14.
Rev. colomb. biotecnol ; 13(2): 84-96, dic 1, 2011. tab, graf
Article in Spanish | LILACS | ID: lil-645170

ABSTRACT

La cepa Pseudomonas fluorescens IBUN S1602 conforma el grupo de aislamientos provenientes de suelos colombianos de caña de azúcar, que acumula polihidrioxialcanoato (PHA), fue seleccionada como promisoria para escalamiento comercial por tener afinidad por sustratos alternativos y económicos como el glicerol, aceites usados, suero de leche, entre otros. Dada la importancia de la enzima sintasa en la síntesis de los PHAs, en el presente trabajo se realizó el análisis molecular de los genes phaC1 y phaC2 que codifican las enzimas sintasas tipo II (PhaC1 y PhaC2). Para la obtención de los amplímeros requeridos en la secuenciación, se utilizó la técnica de PCR bajo condiciones estandarizadas para iniciadores diseñados reportados en las bases de datos. Se identificaron dos fragmentos de 1680 pb y 1683 pb correspondientes a phaC1 y phaC2. El análisis comparativo de las secuencias proteicas resultantes de estos genes demuestra que la sintasa IBUN S1602 contiene la región α/β hidrolasa y 8 residuos de aminoácidos conservados, que son características de las sintasas examinadas a nivel mundial. Se analizó la estructura enzimática a nivel primario y se predijo la secundaria. Se concluyó que las sintasas de la cepa Pseudomonas fluorescens IBUN S1602 presentan alta homología con las sintasas tipo II que se reportan para Pseudomonas. Los resultados obtenidos contribuyen al entendimiento básico de la biosíntesis de PHA, la cual permitirá, en un futuro, el aumento de la calidad de PHA debida a la modulación del nivel de sintasa que se exprese en un organismo recombinante, con el fin de variar el peso molecular del biopolímero, propiedad esencial en el estudio de aplicaciones industriales.


The strain Pseudomonas fluorescens IBUN S1602 forms the group of isolates from colombian sugarcane soil´s, which accumulates polyhydroxyalkanoate biopolymer (PHA) and was selected as promising for commercial scale by having affinity for economic and alternative substrates such as glycerol, oils, whey, among others. Given the importance of the synthase enzyme in the synthesis of PHAs, was realized the molecular analysis of genes phaC1 and phaC2 which encode type II synthases (PhaC1 y PhaC2). To obtain the amplimers required in the sequencing, was used the PCR technique under standardized conditions for primers designed based on the updated review in databases. Were identified two fragments of 1680 bp and 1683 bp for phaC1 and phaC2. Comparative analysis of the resulting protein sequences of these genes shows that the IBUN S1602 synthases containing the region α/β hydrolase and 8 conserved amino acid residues that are characteristic of synthases examined worldwide. Enzyme structure was analyzed at the primary level and was predicted the secondary. It is concluded that synthase strain Pseudomonas fluorescens IBUN S1602 has high homology with type II synthases that are reported for Pseudomonas. The results contribute to basic understanding of the biosynthesis of PHA, and will allow in the future, increasing the quality of PHA due to modulation of the level of synthase is expressed in a recombinant organism, in order to vary the weight molecular biopolymer, an essential property in the study of industrial applications.


Subject(s)
Biopolymers/administration & dosage , Biopolymers/biosynthesis , Biopolymers/classification , Biopolymers/immunology , Computational Biology/classification , Computational Biology/history , Computational Biology/instrumentation , Computational Biology/trends
15.
PLoS One ; 6(10): e26146, 2011.
Article in English | MEDLINE | ID: mdl-22022543

ABSTRACT

Integrating gene regulatory networks (GRNs) into the classification process of DNA microarrays is an important issue in bioinformatics, both because this information has a true biological interest and because it helps in the interpretation of the final classifier. We present a method called graph-constrained discriminant analysis (gCDA), which aims to integrate the information contained in one or several GRNs into a classification procedure. We show that when the integrated graph includes erroneous information, gCDA's performance is only slightly worse, thus showing robustness to misspecifications in the given GRNs. The gCDA framework also allows the classification process to take into account as many a priori graphs as there are classes in the dataset. The gCDA procedure was applied to simulated data and to three publicly available microarray datasets. gCDA shows very interesting performance when compared to state-of-the-art classification methods. The software package gcda, along with the real datasets that were used in this study, are available online: http://biodev.cea.fr/gcda/.


Subject(s)
Computational Biology/classification , Computational Biology/methods , Discriminant Analysis , Gene Regulatory Networks/genetics , Algorithms , Computer Simulation , Databases, Genetic , Gene Expression Regulation, Neoplastic , Humans , Oligonucleotide Array Sequence Analysis , Software
16.
Arch Toxicol ; 85(9): 1015-33, 2011 Sep.
Article in English | MEDLINE | ID: mdl-21523460

ABSTRACT

Thanks to the confluence of genome sequencing and bioinformatics, the number of metabolic databases has expanded from a handful in the mid-1990s to several thousand today. These databases lie within distinct families that have common ancestry and common attributes. The main families are the MetaCyc, KEGG, Reactome, Model SEED, and BiGG families. We survey these database families, as well as important individual metabolic databases, including multiple human metabolic databases. The MetaCyc family is described in particular detail. It contains well over 1,000 databases, including highly curated databases for Escherichia coli, Saccharomyces cerevisiae, Mus musculus, and Arabidopsis thaliana. These databases are available through a number of web sites that offer a range of software tools for querying and visualizing metabolic networks. These web sites also provide multiple tools for analysis of gene expression and metabolomics data, including visualization of those datasets on metabolic network diagrams and over-representation analysis of gene sets and metabolite sets.


Subject(s)
Computational Biology , Databases, Factual , Metabolic Networks and Pathways , Computational Biology/classification , Computational Biology/standards , Databases, Factual/classification , Databases, Factual/standards , Databases, Genetic , Enzymes/metabolism , Information Storage and Retrieval/methods , Internet , Software , User-Computer Interface
17.
PLoS One ; 6(2): e17191, 2011 Feb 16.
Article in English | MEDLINE | ID: mdl-21359184

ABSTRACT

BACKGROUND: Support vector machine (SVM) has been widely used as accurate and reliable method to decipher brain patterns from functional MRI (fMRI) data. Previous studies have not found a clear benefit for non-linear (polynomial kernel) SVM versus linear one. Here, a more effective non-linear SVM using radial basis function (RBF) kernel is compared with linear SVM. Different from traditional studies which focused either merely on the evaluation of different types of SVM or the voxel selection methods, we aimed to investigate the overall performance of linear and RBF SVM for fMRI classification together with voxel selection schemes on classification accuracy and time-consuming. METHODOLOGY/PRINCIPAL FINDINGS: Six different voxel selection methods were employed to decide which voxels of fMRI data would be included in SVM classifiers with linear and RBF kernels in classifying 4-category objects. Then the overall performances of voxel selection and classification methods were compared. Results showed that: (1) Voxel selection had an important impact on the classification accuracy of the classifiers: in a relative low dimensional feature space, RBF SVM outperformed linear SVM significantly; in a relative high dimensional space, linear SVM performed better than its counterpart; (2) Considering the classification accuracy and time-consuming holistically, linear SVM with relative more voxels as features and RBF SVM with small set of voxels (after PCA) could achieve the better accuracy and cost shorter time. CONCLUSIONS/SIGNIFICANCE: The present work provides the first empirical result of linear and RBF SVM in classification of fMRI data, combined with voxel selection methods. Based on the findings, if only classification accuracy was concerned, RBF SVM with appropriate small voxels and linear SVM with relative more voxels were two suggested solutions; if users concerned more about the computational time, RBF SVM with relative small set of voxels when part of the principal components were kept as features was a better choice.


Subject(s)
Algorithms , Electronic Data Processing/methods , Magnetic Resonance Imaging/methods , Pattern Recognition, Automated/methods , Software , Brain Mapping/classification , Brain Mapping/methods , Brain Mapping/statistics & numerical data , Computational Biology/classification , Computational Biology/methods , Computational Biology/statistics & numerical data , Electronic Data Processing/classification , Female , Humans , Male , Nonlinear Dynamics , Pattern Recognition, Automated/classification , Reproducibility of Results , Software/classification
19.
Brief Bioinform ; 10(5): 537-46, 2009 Sep.
Article in English | MEDLINE | ID: mdl-19346320

ABSTRACT

Recent development of high-throughput technology has accelerated interest in the development of molecular biomarker classifiers for safety assessment, disease diagnostics and prognostics, and prediction of response for patient assignment. This article reviews and evaluates some important aspects and key issues in the development of biomarker classifiers. Development of a biomarker classifier for high-throughput data involves two components: (i) model building and (ii) performance assessment. This article focuses on feature selection in model building and cross validation for performance assessment. A 'frequency' approach to feature selection is presented and compared to the 'conventional' approach in terms of the predictive accuracy and stability of the selected feature set. The two approaches are compared based on four biomarker classifiers, each with a different feature selection method and well-known classification algorithm. In each of the four classifiers the feature predictor set selected by the frequency approach is more stable than the feature set selected by the conventional approach.


Subject(s)
Algorithms , Biomarkers , Computational Biology , Models, Biological , Computational Biology/classification , Computational Biology/methods , Databases, Genetic , Mathematics , Reproducibility of Results
20.
ChemMedChem ; 4(7): 1174-81, 2009 Jul.
Article in English | MEDLINE | ID: mdl-19384901

ABSTRACT

Fragment formal concept analysis (FragFCA) for compound classification: Signature fragment combinations for compound classes with closely related biological activity were identified using FragFCA. These combinations are used to accurately classify active test compounds on the basis of fragment mapping. FragFCA can extract class-specific fragment combinations from compounds active against different target families that have signature character and practical utility in compound classification and database searching.Formal concept analysis (FCA), originally developed in information science, has been adapted to identify relationships between fragments of compounds and their biological activity. Here applications of the FragFCA approach with practical utility for medicinal chemistry are explored. Hierarchically derived fragment populations of 24 classes of compounds active against eight target families were subjected to FragFCA analysis, and fragment combinations were identified that distinguished compounds with closely related biological activity from each other. Mapping of signature fragment combinations was carried out to classify active compounds for different target families with high accuracy. The results indicate that compound-class-specific structural information and selectivity determinants are predominantly encoded by fragment combinations, rather than individual fragments. Furthermore, class-specific fragment combinations were successfully applied in similarity searching. The results demonstrate that FragFCA is capable of identifying fragment combinations that differentiate between compound sets with closely related biological activity and that can be used to predict structure-activity relationships.


Subject(s)
Combinatorial Chemistry Techniques/methods , Computational Biology/methods , Drug Design , Combinatorial Chemistry Techniques/classification , Computational Biology/classification , Databases, Factual , Pharmaceutical Preparations/classification , Structure-Activity Relationship
SELECTION OF CITATIONS
SEARCH DETAIL
...